The article introduces Fast-dLLM, a method for accelerating diffusion-based large language models (LLMs) by implementing a block-wise approximate Key-Value (KV) Cache and a confidence-aware parallel decoding strategy. This approach addresses the slow inference speed of diffusion LLMs and mitigates quality degradation during parallel token decoding, achieving significant throughput improvements while maintaining accuracy. Experimental results show up to 27.6 times higher throughput, facilitating the practical deployment of diffusion LLMs.
diffusion ✓
language models ✓
parallel decoding ✓